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Abstract 

Background: Due to the recent rapid development in ChlP-seq technologies, which uses high-throughput next- 
generation DNA sequencing to identify the targets of Chromatin Immunoprecipitation, there is an increasing 
amount of sequencing data being generated that provides us with greater opportunity to analyze genome-wide 
protein-DNA interactions. In particular, we are interested in evaluating and enhancing computational and statistical 
techniques for locating protein binding sites. Many peak detection systems have been developed; in this study, we 
utilize the following six: CisGenome, MACS, PeakSeq, QuEST, SISSRs, and TRLocator. 

Results: We define two methods to merge and rescore the regions of two peak detection systems and analyze 
the performance based on average precision and coverage of transcription start sites. The results indicate that 
ChlP-seq peak detection can be improved by fusion using score or rank combination. 

Conclusion: Our method of combination and fusion analysis would provide a means for generic assessment of 
available technologies and systems and assist researchers in choosing an appropriate system (or fusion method) for 
analyzing ChlP-seq data. This analysis offers an alternate approach for increasing true positive rates, while 
decreasing false positive rates and hence improving the ChlP-seq peak identification process. 



Background 

Introduction 

One of the most important biotechnologies developed in 
the 20* century is the Sanger method for the sequencing 
of DNA [1]. Recently developed next-generation DNA 
sequencing (NGS) technologies have increased DNA 
sequencing capacity by many orders of magnitude, making 
entirely new applications possible [2,3]- Chromatin Immu- 
noPrecipitation (ChIP) is a biochemical method to iden- 
tify binding sites on DNA that interact with proteins. It 
involves cross-linking proteins to DNA with a reagent 
such as formaldehyde, randomly shearing the DNA into 
small fragments (200-500 base pairs) (fragmentation), 
then using an antibody specific for a known DNA-inter- 
acting protein to isolate DNA fragments bound to the 
target protein [4] (immunoprecipitation). 
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Genomics 



The combination of the ChIP process and microarray 
DNA chip technologies lead to the method of Chip-on- 
chip [5] or ChlP-chip [6] that can identify DNA frag- 
ments isolated by ChIP using a DNA microarray contain- 
ing large numbers of probes of known genomic sequences. 
ChlP-seq [7] uses next-generation sequencing (NGS) 
to identify the DNA fragments isolated by ChIP. Next- 
generation DNA sequencing machines are capable of 
simultaneously determining the sequences of millions of 
DNA fragments in a single sample with a high degree of 
accuracy (high-throughput sequencing). The sequence 
reads (known as tags) obtained from ends of ChlP-selected 
DNA fragments are typically 25-50 base pairs long. These 
short reads can then be mapped to a reference genome by 
a stringent DNA sequence alignment algorithm such as 
ELAND (Illumina Inc.), MAQ [8], or Bowtie [9] (map- 
ping). Sequence reads that do not map to a unique posi- 
tion on the genome (with 2 or fewer mismatches) are 
generally discarded. The final product of such a mapping 
procedure is a set of positions on the reference genome 



Schweikert et al. BMC Genomics 2012, 13(Suppl 8):S12 
http://www.biomedcentral.com/1471-2164/13/S8/S12 



Page 2 of 12 



indicating the start and end of each short sequence read. 
Once the reads are mapped to the genome, the tag posi- 
tions can then be analyzed for clusters of tags or "peaks", 
which indicate (predict) protein binding (or histone modi- 
fication) positions enriched by the ChIP (peak detection). 
The results of ChlP-seq studies can provide an unbiased 
genome-wide profile of DNA regulatory regions targeted 
by transcription factors as well as the signatures of modi- 
fied histone proteins associated with epigenetic changes in 
chromatin. Figure 1 shows a framework for a ChlP-seq 
experiment and analytic workflow [10]. 

Peak detection is the last and probably most crucial 
and dynamic step in the process of the ChlP-seq method 
and system after fragmentation, immunoprecipitation, 
sequencing, and mapping. Along the pipeline, the set of 
mapped sequence tags can easily acquire noise from 
background contamination, co-precipitation of unbound 
DNA fragments, non-specific interactions of the ChIP 
target protein with DNA, and a variety of sources such as 
replication and amplification artifacts (e.g. PCR artifacts). 
A useful ChlP-seq peak detection technique or tool has 
to be robust and reliable. With the rising popularity and 
increasing importance of ChlP-seq, there has been a pro- 
liferation of new analytical and computational methods 
to find peaks in ChlP-seq data. At the last count, there 



are over 30 open source programs, in addition to many 
commercial software applications, available to the 
research community [11]. 

The first step in the peak detection process is to identify 
those genomic regions with a large number of mapped 
sequence tags (enriched regions) [12-24]. Then the peak 
detection and identification system must determine the 
number of tags (peak heights) or directionality score (tag 
count) that constitutes enrichment "significant" enough to 
represent a protein-DNA binding site. In this way, tag 
count (T) is a scoring function in which the system assigns 
a number to each possible region. Often, a tag count 
threshold is chosen to define a peak [24]. One way to set 
this threshold is to compare the distribution of tags in 
enriched regions to tags that are placed randomly on the 
genome. The outcome is a significance value (p-value) of 
the sequence tag enrichment. This value (P) is also a scor- 
ing function used to select peaks [16,17,20,22]. Some 
methods use sequence data from a control dataset and 
then use the control tag densities to assess the significance 
of peaks in the ChIP sample set. In this case, a fold 
enrichment (F) ratio of ChIP tag count over the normal- 
ized control tags in the candidate regions is calculated to 
give another scoring function [7,14,17,25]. Different meth- 
ods use various statistical models to assess the significance 
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Figure 1 ChlP-seq experiment and analytic workflow. 
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of Chip peaks or assign a false discovery rate (FDR) to 
each peak with respect to control data [12,16,18-26]. 

Recently, Pepke et al [27] gave a review of the major 
steps in ChlP-seq analysis and described the algorithmic 
approaches of twelve existing programs for detecting 
peaks. Laajala et al [28] gave some metrics for evaluating 
various methods of peak detection. Wilbanks and Fac- 
ciotti [11] compared the performance of eleven different 
peak calling programs on common empirical, transcrip- 
tion factor data sets. Their work offers a variety of ways 
to assess the performance of each algorithm and address 
the questions as to how to select the most suitable 
among several available methods of ChlP-seq analysis for 
peak detection. In our study, we evaluate six methods: 
CisGenome, MACS, PeakSeq, QuEST, SISSRs, and 
TRLocator [16,17,23,24,26,29] based on the three attri- 
butes: tag count, p-value, and fold change, and their com- 
binations. We then analyze the merged results of all two- 
method combinations. In particular, a recently developed 
information fusion method. Combinatorial Fusion Analy- 
sis [30], is utilized to demonstrate that ChlP-seq peak 
detection can be improved by fusion using score or rank 
combination. Our study offers an alternate approach to 
select a suitable method for ChlP-seq analysis. This study 
also offers ways to improve existing methods by combin- 
ing them in an appropriate way using Combinatorial 
Fusion Analysis. 

Based on preliminary experiments, we have observed 
that the peak-detection abilities of available ChlP-seq 
methods and systems vary greatly depending on the type 
of protein that is targeted by the antibody used in the 
Chip. We have identified three types of protein-DNA 
interactions that generate very different results when the 
same peak detection system is used to analyze the ChlP- 
seq data [10]. The first observation is that transcription 
factors, such as E2F4, bind strongly to a single highly spe- 
cific DNA sequence (a motif) near the transcription start 
site (TSS) of a gene, and are characterized by distinct 
ChlP-seq peaks -500 bases wide, with oriented tags that 
approximately follow a normal distribution. A second 
observed pattern is with transcription factors, such as 
Sin3a, that bind weakly to DNA together with co-factors, 
yielding wider ChlP-seq peaks (800-1600 bases) with a 
flat distribution of lower tag density and un-oriented 
tags. A third kind of ChlP-seq target, modified histone 
proteins, such as tri-methylated H3K4, produce much 
wider peaks (~4000 bases) and un-oriented tags [10]. In 
this study, we use a trimethylated H3K4 (H3K4me3) data 
set [31]. 

Previous work 

Similar to the analysis of microarray gene expression 
data, many computational methods have recently been 
developed for the analysis of ChlP-seq data. In both 



cases, the proliferation of software and systems was an 
indication that it is difficult to find a single well-validated 
method that performs well in a variety of domain appli- 
cations. It also depends on what criteria one uses to eval- 
uate the systems. In this study, we use the following six 
methods and systems to analyze their intra- and inter- 
system properties and improvement by combination. 
They are (A) CisGenome [16], (B) MACS [24], (C) Peak- 
Seq [26], (D) QuEST [23], (E) SISSRs [17], and (F) TRLo- 
cator [29]. 

CisGenome [16] uses a two-pass algorithm for peak 
detection to ensure adjustment for DNA fragmentation 
length. It can analyze both ChlP-seq and ChlP-chip data, 
or combine the two. In order to correct many types of 
systemic bias created by sample preparation, amplifica- 
tion, sequencing (or hybridization), and alignment, it 
uses both a ChIP sample and a negative control sample 
(input DNA or mock-ChIP with IGG) to compute FDR 
at each specific location. It also provides methods to 
detect binding regions, peak localization, and filtering. 

QuEST [23] provides a data-driven statistical analysis 
model to generate peak calls by leveraging the key attri- 
butes of the sequenced and aligned DNA reads, such as 
directionality (strand orientation) and the original size of 
ChlP-isolated DNA fragments. The statistical framework 
used is the kernel density probability estimation approach, 
which facilitates the aggregation of signals originated from 
densely packed sequence reads at protein interaction sites. 

MACS (Model-based Analysis of ChlP-Seq) [24] empiri- 
cally models the shift size of ChlP-seq tags to enhance 
peak identification by taking advantage of the bimodal pat- 
tern of forward and reverse tags. MACS also utilizes a 
dynamic Poisson distribution to identify local biases in the 
genome. 

Site Identification from Short Sequence Reads (SISSRs) 
[17] estimates high read counts using Poisson probabil- 
ities and calls regions where the peaks shift from the for- 
ward to the reverse strand. The SISSRS method is 
attractive because it explicitly makes use of information 
from the orientation of tags around a protein binding site 
- where it is expected that forward strand tags will be 
found upstream of the true binding site and reverse 
strand tags downstream. This allows for very precise pre- 
diction of the actual binding site. However, for regions of 
low tag density or for histone methylation ChIP, where 
tags are not neatly oriented, it tends to create many dif- 
ferent peaks across enriched regions, which may not be 
reproducible across replicates. 

PeakSeq [26] utilizes input-DNA control data to refine 
the selection and scoring of peak regions in Chip-seq 
experiments to improve the identification of transcription 
factor binding sites. Since it has been observed that signal 
peaks in the control data are highly correlated with poten- 
tial binding sites, PeakSeq compensates for this signal, 
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caused by open chromatin structure, with a two-pass strat- 
egy. PeakSeq first identifies enriched peaks in the Chip-seq 
data as candidate regions. These putative regions are then 
compared to the normahzed control and the regions that 
are significantly enriched with mapped sequence tags rela- 
tive to the control are identified as binding sites. 

TRLocator [29] is a peak detection method that has 
been developed at NYU-CHIBI. The algorithm utilizes 
the distribution of the background data to compute p- 
values for putative peaks in the ChlP-seq data. Putative 
peak regions are generated based on a variable merging 
window size that can be adjusted according to the kind 
of data set being analyzed. Custom filters for finding 
qualified peak regions include: p-value, minimum num- 
ber of tags within each putative peak, balance between 
the number of tags aligned to the positive strand and 
the number of tags aligned to the negative strand, and 
the log2 ratio between ChIP tags and background tags. 

Methods 

Combining peak detection systems 
Multiple scoring systems 

We propose that the peak detection for each of the bind- 
ing sites be viewed as a scoring system on the set of all 
possible binding site regions. Different scoring systems 
for peak detection can represent different features/cues/ 
attributes or different algorithms/methods/systems. They 
can also represent different technical replicates or differ- 
ent biological replicates using each of the same set of fea- 
tures or cues/attributes or the same algorithm or 
method/system. By using multiple scoring systems 
defined on the set of possible binding site regions to 
detect peaks for each of the binding sites, we can study 
the reproducibility of peak calls among different repli- 
cates. We also use multiple scoring systems to develop 
and design new algorithms with greater accuracy, effi- 
ciency, and scalability for detecting protein binding sites 
in ChlP-seq data alignment. We draw from recent 
research in combinatorial fusion [32,33]. Using a rank- 
score characteristic graph to measure the scoring diver- 
sity [34], combinatorial fusion has been an active 
research area in the past ten years in a variety of applica- 
tion domains such as microarray gene expression analysis 
[35], motif finding [36], protein structure prediction [37], 
virtual screening [38], information retrieval [39,33], and 
target tracking [40]. 

In our preliminary work, we analyzed the six individual 
systems according to three features, which include: tag 
count, p-value, and fold change (enrichment of ChIP tags 
compared to background control tags at the same geno- 
mic locus) [41]. We analyzed these features and their 
combinations according to average precision and 
observed that, in most cases, the tag count feature out- 
performed other features and combinations of features. 



Since tag count was the most consistent and best per- 
forming feature between the methods, we choose to use 
the Chip tag count as the score function to represent 
each method's scoring of the regions identified. Let D = 
{di, d2,—, dj be the set of regions identified by system x 
and the score function s^(d) be the tag count of that 
region (number of ChIP tags in the data set that are 
located within that chromosomal region). Let the rank 
function r;^(d) be the function from D to N = [1, n] = 
{1,2,..., n} which is obtained by sorting the values in Sy^(d) 
into descending order and converting the function Sx(i50 
into the function r-J^^d) using the rank as its function 
value. 

Combining two peak detection systems 
Union 

The union of two systems, x and y, U(x, y) is the set of 
regions that contains all regions identified by x and all 
regions identified by y, where overlapping regions between 
the two methods are merged together to form new merged 
regions. All non-overlapping regions that belong to either 
x or y will maintain their genomic positions (chromosome, 
start and end bp coordinates). Each merged region will 
have a start position that is the minimum of all start posi- 
tions of its overlapping regions from x and y, and an end 
position that is the maximum of all end positions of those 
overlapping regions. This new set of regions, U(x, y) = {di, 
d2, :., dp}, is scored based on the tag counts of systems x 
and y, as follows. Systems x and y have new score func- 
tions based on the regions in this union: s^{d) and Syid). 
s,;{d) is obtained according to the following: 
Single regions - if the region was identified by system x, 
the score is the tag count given by x; otherwise the score 
isO. 

Merged regions - the score is the sum of the tag counts 
for the regions (that are part of this merged region) that 
were identified by x. 

Syid) is obtained in the same manner. The score func- 
tions are then scaled from 0 to 1 by the following normali- 
zation: score function s^{d): U(x, y) ^ R\s transformed to 

Sx* * {d) :U [x, y) [0, 1] where, 5x' * (d) = ^i-i^ — £inm^ 

^max ^min 

Smax = max{ s-^-{d): de U(x, y)}, and = min{ s^-{d): 
de U(x, y)}. Syid) is also normalized accordingly. The 
rank functions r^cid) and ry{d) from U(x, y) to N = {1, 2, 
pj assign a rank to each region after sorting the 
scores given by Sy^'{d) and Sy'{d) in descending order, 
respectively. In order to provide a single score and rank 
for each region in U(x, y) that is based on combined 
information from systems x and y, we perform score 
and rank combinations. The score combination for the 
union of systems x and y is defined as: 
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^u(x,y) (d) = - (v * (d) + 5y' * (d)) and the rank com- 
bination is computed according to: 

'■u(x,y) (d) = J (rx' (d) + ry (d)) . 
Intersection 

The intersection of two systems, x and y, I(x, y) is the 
set of the merged regions formed by overlapping regions 
of system x and y. I(x, y) £ U(x, y) where I(x, y) = {i L U 
(x, y): i is a merged region that contains overlapping 
regions from both systems x and y}, giving the set I(x, y) 
= {di, d2, ■:, dqj. The regions belonging to the intersec- 
tion are scored in the same way merged regions are 
scored in the union. The score functions for systems x 
and y, based on their intersection, Sx'((/) and Sy(<i), assign 
a score to each of the merged regions that is the sum of 
the tag counts for the regions identified by x or y that 
are part of this merged region. 

s^'{d) and Sy'(<i) are then normalized to the scale [0,1] 
(as described above) to give s^''{d) and Sy'''(<i). The 
regions of the intersection are ranked according to 
their score (descending order) to give rank functions 
r^'{d) and ryid). Similar to the case of union, score and 
rank combinations for the intersection of systems x 
and y are computed. The score and rank combinations 

are defined as: 5i(-x,y) (d) = ^ (sx' * (d) + Sy * (d)) and 

ri(x,y) id) = J (rx' id) + ry (d)), respectfully. 

Example from H3K4 data set: The visualization in Fig- 
ure 2 shows peaks identified by all individual methods, 
along with the TSS regi2on, near the ARRDC4 gene 
(ARRDC4; Chromosome: 15; 96,304,937-96,318,072, 
UCSC Genome Browser Mar. 2006 assembly). Figure 3 
demonstrates the intersection and union of the PeakSeq 
and QuEST methods in the area depicted above. The 
intersection contains the merged regions that are formed 
by overlapping regions between the two methods. The 
union contains these merged regions and all non-over- 
lapping regions of the individual methods. 



Performance evaluation methods 
Average precision 

For many transcription factors, DNA polymerase II, and 
some modified histones such as tri-methylated H3K4, 
the majority of binding sites are located near the tran- 
scription start sites (TSS) of expressed genes. Therefore, 
it is possible to evaluate ChlP-seq software systems, and 
different combination methods, by their ability to locate 
peaks at a TSS. While not all true peaks are located at a 
TSS, not all TSS are correctly annotated in the reference 
genome, and not all true TSS have such a peak, the 
ratio of peaks located at an annotated TSS vs. those 
located elsewhere on the genome is a measure of preci- 
sion of the peak finding method. We have validated this 
concept by visualizing all aligned tags on the genome 
without first identifying peaks. Peaks can be observed in 
the vicinity of most TSS annotated in the RefSeq data- 
base. An average peak can be visualized by superimpos- 
ing the coverage depth of sequence reads for DNA 
regions within 1000 bases flanking all annotated RefSeq 
TSS (Figure 4). No TSS peak is found in control DNA. 

In this evaluation, we compare the peaks identified by 
a particular system (or combination of two systems) 
against the set of RefSeq TSS in the human genome. 
Average precision is used to evaluate the performance 
of systems and the result of fusion. A region is consid- 
ered relevant if it overlaps with a TSS in the annotated 
set. We define the following overlap function for a 
region at rank i: 

II, region overlaps with a TSS 
0, otherwise 

Precision at rank r is computed as: 
toii) 

p(r) = 

Average precision for a system that identifies n 
regions is defined as: 
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Figure 2 Regions of individual methods upstream of the ARRDC4 gene 
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Figure 3 Example of the intersection and union between PeakSeq and QuEST regions 
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Systems that have more regions in lower ranks that 
overlap with a TSS will have higher average precision. 
Coverage 

When evaluating the performance of a peak finding sys- 
tem, it is also important to consider its coverage, in 
terms of the breadth of TSS covered by its peaks 
(regions). Given a set of regions identified by a system 



or combination method, we generate the set of TSS that 
overlap with these regions; the coverage (C) is the num- 
ber of unique TSS reached by the system: 

C = I {TSS that overlap with region (s) of system\ | . 



Results 

System fusion and evaluation 

The Chip tag count for a region is used as a score func- 
tion to create all 2-combinations of the six systems: 




Figure 4 ChlP-seq tags from an immunoprecipitation with antibody for H3K4 and an IGG control The graph shows the total number of 
tag start positions mapped to each basepair within 1000 bp flanl<ing all annotated RefSeq TSS. Tags mapped to the forward strand are shown 
in blue and the reverse strand in green. The graph also shows a very clear nucleosome depleted region located exactly at the TSS. 
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CisGenome, MACS, PeakSeq, QuEST, SISSRs, and 
TRLocator. We perform two kinds of combination: 
intersection and union (see Methods section). The inter- 
section of two systems is expected to improve specificity 
(detected by both systems) while the union is expected 
to improve sensitivity (detected by either system). When 
evaluating each system or combination of two systems, 
we use average precision and coverage (see Methods 
section). These results are listed in Tables 1, 2, 3, 4, 5, 6 
with corresponding diagrams in Figures 5, 6, 7, 8. 

Tables 1, 2, and 3 list the average precision for the six 
individual systems and all fifteen 2-combinations by inter- 
section (*) and union (+). The six individual systems in 
order of performance according to average precision are: 
A = TRLocator, B = MACS, C = PeakSeq, D = QuEST, 
E = CisGenome, and F = SISSRs. In Table 2, it can be 
observed that all 2-combinations by intersection are posi- 
tive cases, which means its performance is better than or 
equal to the best of the two individual systems. Each of 
the two combinations: A*E between TRLocator (A) 
(ranked #1) and CisGenome (E) (ranked #5) and A* F 
between TRLocator (A) (ranked #1) and SISSRs (F) 
(ranked #6) is better than the three 2-combinations A*B, 
A*C, and A*D that involve TRLocator (A) (ranked #1), 
MACS (B) (ranked #2), PeakSeq (C) (ranked #3), and 
QuEST (D) (ranked #4). Moreover, each of the 2-combi- 
nations B*E and B*F is better than the other 2-combina- 
tions B*C and B*D. This phenomenon is quite interesting 
- individual systems such as CisGenome (E) (ranked #5) 
and SISSRs (F) (ranked #6), which are lesser preferred, can 
be combined with other systems (in this case, with TRLo- 
cator (A) (ranked #1)) to outperform other system combi- 
nations. Almost all of the 2-combinations by union (+) in 
Table 3 are negative cases - the performance of the 
2-combination is less than the best performance of the 
single cases - except for the 2-combination of B+C 
(MACS and PeakSeq). It is also interesting to note that the 
three 2-combinations A+C, A+D, and A+E are better than 
the 2-combination A+B, reflecting the same phenomenon 
observed in Table 2. 

Tables 1, 2, 3, 4, 5, 6 list the four cases of inter-system 
fusion: average precision for 2-combinations by intersec- 
tion and by union; and coverage for 2-combinations by 

Table 1 Average precision for single methods. 



Method Average precision; ranl< Number 

of regions 



F = SISSRs 


0.8212; 6 


20715 


E = CisGenome 


0.8277; 5 


21190 


D = Quest 


0.8281; 4 


21514 


C = PeakSeq 


0.8634; 3 


20000 


B = MACS 


0.9023; 2 


19918 


A = TRLocator 


0.9217; 1 


19573 



intersection and by union. The huge difference between 
average precision of the intersection and union is that 
the former has all the positive 2-combination cases, 
while the latter has all (but one) negative cases. Com- 
paring Table 5 and Table 6 we find that each of the 
unions of two systems in Table 6 has higher coverage 
than those of the intersections of two systems in Table 
5. Another difference is that in Table 5, 2-combinations 
C*D, C*E, and D*E move up to the second, third, and 
first ranks, while in Table 6, the 2-combinations invol- 
ving CisGenome (C) (ranked #3) and SISSRs (F) (ranked 
#6), such as A+C, C+F, B+C, A+F, and B+F move up to 
the top five rankings. 

Discussion 

Evaluation of peak detection systems involves analyzing 
the regions identified as peaks according to criteria such 
as the average precision and TSS coverage. 

Average precision measures the performance of a sys- 
tem according to higher scoring regions overlapping 
with a TSS. The intersection of two methods refers to 
the set of regions formed by extracting overlapping 
regions between two methods and merging them to 
form new regions. This set of regions represents the 
common peaks detected by both systems. The average 
precision of all 15 2-combinations improved when the 
intersection was evaluated. Combination by union only 
produced one result that improved average precision, 
MACS and PeakSeq. 

When evaluating system combination according to TSS 
coverage, we refer to the number of unique TSS regions 
reached. When using the method of union to combine, 
all 15 2-combinations show improvement from both ori- 
ginal systems. The result of combining two methods by 
union includes all overlapping regions that are then 
merged (intersection), in addition to all other regions 
belonging to each individual method. Some combinations 
show more improvement than others, which indicates 
that regions generated by those 2 systems are more 
diverse in terms of region location. For example, the 
regions identified by CisGenome overlap with 14010 
unique TSS, and the coverage of PeakSeq is 15611. The 
combination of CisGenome and PeakSeq by union yields 
results that have a coverage of 21738, which means the 
combined result reaches many more TSS. Another exam- 
ple is for MACS and TRLocator, which individually have 
similar performance for coverage, 11804 and 11850, 
respectively. However, the combination of MACS and 
TRLocator by union greatly improves the performance 
and now reaches 20127 unique TSS; this demonstrates 
the diversity of the two systems. When using the method 
of intersection for system combination, 4 out of 15 com- 
binations outperformed their component individual sys- 
tems. Since the intersection consists of the merged. 
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Table 2 Average precision for the intersection {*] of two methods. 

X * y Average precision Number 

of regions 





Score combination 


Rank combination 




C * F = PeakSen * SISSRs 


0887166 


0887675 


13293 


F * F - risGpnnmp * SISSRs 




0 885596 


12841 


C. ^ ^ = P^BkSGC] * CisGsnoms 


0.90221 1 


0.892652 


12662 


C ^ D = PeakSen * OuEST 


09100S6 


0.920872 


1 1865 


Pt * C iCCT * CICCDr 

U r = UubbI bboKS 


U.y 1 1 U4o 


O.yUo/ /4 


10789 


D * E = Quest » CisGenome 


0.914799 


0.917028 


14452 


B • D = MACS » Quest 


0.938479 


0.937476 


14528 


B * C = MACS * PeakSeq 


0.941655 


0.948495 


12095 


B * F = MACS • SISSRs 


0.942113 


0.950036 


11003 


B * E = MACS * CisGenome 


0.949365 


0.948955 


14244 


A * B = TRLocator * MACS 


0.951392 


0.950802 


16921 


A » D = TRLocator * QuEST 


0.951877 


0.950939 


13270 


A * C = TRLocator » PeakSeq 


0.952759 


0.956961 


11573 


A * F = TRLocator * SISSRs 


0.959214 


0.960584 


10463 


A * E = TRLocator * CisGenome 


0.959687 


0.9591 1 1 


13155 



Table 3 Average precision for the union (+) of two methods. 


X + y 


Average precision 




Number 
of regions 




Score combination 


Ranl< combination 




E + F = CisGenome + SISSRs 


0.8114 


0.7997 


26371 


D + E = QuEST + CisGenome 


0.8190 


0.8158 


26457 


D + F = QuEST + SISSRs 


0.8204 


0.8038 


25574 


C + D = PeakSeq + QuEST 


0,8526 


0.8475 


22191 


C + F = PeakSeq + SISSRs 


0.8545 


0.8559 


22876 


C + E = PeakSeq + CisGenome 


0.8610 


0.8522 


24415 


B + F = MACS + SISSRs 


0.8880 


0.8950 


20767 


B + D = MACS + QuEST 


0.8883 


0.8876 


21242 


B + E = MACS + CisGenome 


0.8983 


0.8977 


20768 


B + C = MACS + PeakSeq 


0.8983 


0.9033 


19895 


A + F = TRLocator + SISSRs 


0.9126 


0.9030 


19673 


A + B = TRLocator + MACS 


0.9168 


0.9158 


20279 


A + D = TRLocator + QuEST 


0.9168 


0.9071 


20117 


A + E = TRLocator + CisGenome 


0.9193 


0.9178 


19720 


A + C = TRLocator + PeakSeq 


0.9199 


0.9177 


19281 



Table 4 Coverage for single methods. 



IVIethod Coverage; ranl< Number 

of regions 

F = SISSRs 9322; 6 20715 

E = MACS 11804 5 19918 

D = TRLocator 11 850; 4 19673 

C = CisGenome 14010; 3 21 190 

B = QuEST 14440; 2 21514 

A = PeakSeq 15611; 1 20000 



overlapping regions of two methods, improvement would 
take place if the merged region reaches a TSS missed by 
the regions before being merged. 

Conclusions 

This study entails the evaluation of and selection among 
multiple detection systems for ChlP-seq peak identifica- 
tion. In order to do so, we use six well-known methods 
A = CisGenome, B = MACS, C = PeakSeq, D = QuEST, 
E = SISSRs, and F = TRLocator and obtain the regions 
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Table 5 Coverage for the intersection ( 


*) of two methods. 








Cover3C|6 




Number 








of regions 




Score combination 


Rank combination 




B * F = QuEST * SISSRs 


10016 


10016 


10789 


C * F = CisGenome * SISSRs 


11211 


1 121 1 


12841 


A * B = PeakSeq * QuEST 


1 1920 


1 1920 


1 1865 


A * C = PeakSec] * CisGenome 


12010 


12010 


12662 


E * F = MACS * SISSRs 


12351 


12351 


1 1003 


B * C = QuEST * CisGenome 


12459 


12459 


14452 


D * F = TRLocator * SISSRs 


12662 


12662 


10463 


A * F = PeakSeq * SISSRs 


12921 


12921 


13293 


A * E = PeakSeq * MACS 


13717 


13717 


12095 


A * D = PeakSeq * TRLocator 


13939 


13939 


11573 


B * E = QuEST * MACS 


14700 


14700 


14528 


B * D = QuEST * TRLocator 


14725 


14725 


13270 


C * E = CisGenome * MACS 


14947 


14947 


14244 


C * D = CisGenome * TRLocator 


15075 


15075 


13155 


D * E = TRLocator * MACS 


17725 


17725 


16921 


identified by each on a common ChlP-seq data set and 


(2) Average precision of union: All 2-combinations 


utilize the tag count as a score function representing each 


(except one) are negative cases. 




method. We define two methods to combine and rescore 


(3) Coverage of intersection: Some 2-combinations 


the regions of two systems, namely, union and intersec- 


are positive, while some are negative. 


tion. Average precision and TSS coverai 


ge are used to eval- 


(4) Coverage of union: All 2-combinations are posi- 


uate the performance of all 2-combinations of these six 


tive cases. 




systems. We summarize our results as follows: 


(5) In the case of coverage of intersection, 2-combina- 






tions D*E, C*D, and C*E are ranked #1, #2, and #3 


(1) Average precision of intersection: All 2-combina- 


among all 15 2-combinations, respectively. For the 


tions are positive cases 




coverage of union, 2-combinations A+C, C+F, B+C, A 


Table 6 Coverage for the union (+) of two methods. 






X + y 


Coverage 




Number 








of regions 




Score combination 


Rank combination 




A + E = PeakSeq + MACS 


18964 


18964 


19895 


A + D = PeakSeq + TRLocator 


19433 


19433 


19281 


C + E = CisGenome + MACS 


19458 


19458 


20768 


E + F = MACS + SISSRs 


19459 


19459 


20767 


A + B = PeakSeq + QuEST 


19520 


19520 


22191 


D + F = TRLocator + SISSRs 


19742 


19742 


19673 


B + E = QuEST + MACS 


19760 


19760 


21242 


C + D = CisGenome + TRLocator 


19767 


19767 


19720 


B + D = QuEST + TRLocator 


20014 


20014 


20117 


D + E = TRLocator + MACS 


20127 


20127 


20279 


B + F = QuEST + SISSRs 


21003 


21003 


25574 


A + F = PeakSeq + SISSRs 


21032 


21032 


22876 


B + C = QuEST + CisGenome 


21165 


21165 


26457 


C + F = CisGenome + SISSRs 


21360 


21360 


26371 


A + C = PeakSeq + CisGenome 


21738 


21738 


24415 
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Avei age Precision for tlie Intersection of Two Metliocls 





0.95 

0.9 

a. 
< 

0.85 
0.8 
0.75 

F E D C B A CF EF CE CD DF DE BD BC BF BE AB AD AC AF AE 

Method 

A=niLocator, B=MACS, C=PeakSeq, D=QiiEST, E=CisGenome, F=SISSRs 
Figure 5 Average precision for intersection of two methods 



-Score 
Combination 

-Rank 
Combination 



+F, and B+F are ranked #1, #2, #3, #4, and #5 among 
the 15 2-combinations, respectively. 

In summary, we have the following observations 
resulting from the above experiments: 

• There is no single answer as to the selection of avail- 
able methods (and systems) for ChlP-seq peak detection. 
It depends on the criteria (e.g. features) and performance 
evaluation (e.g. average precision or TSS coverage). 

♦ Combinations of different methods (systems) do 
improve results in many cases (average precision of 
intersection, coverage of union, some for coverage of 
intersection). Some combinations of lesser preferred sys- 
tems may outperform all other system combinations. 



• Average precision improved more when combining 
two systems by intersection and coverage improved 
more when two methods are combined by union. 

• The use of the rank function in our evaluation of 
multiple detection systems provides a generic framework 
to study the preference and relative preference for the 
method (or system) selection process. 

In our future work, we will explore conditions such as 
diversity between or performance ratio of two methods 
(systems) of which two or more systems should be com- 
bined to obtain a better system (positive cases). Future 
work also involves application of method combination to 
other proteins and transcription factors. As not all TSS 
may be annotated in the reference genome, identifying 



Average Precision for tlie Union of Two Metliods 




-Score 
Combination 

-Rank 
Combination 



-1 1 1 1 1 1 1 1 1 1 1 1 r 1 1 1 1 1 1 1 1 

F E D C B A EF DE DF CD CF CE BF BD BE BC AF AB AD AE AC 
Method 

A=TRLocator, B=MACS, C=PeakSeq, D=QuEST, E=CisGenome, F=SISSRs 
Figure 6 Average precision for union of two methods. 
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Coverage for the Intersection of Two Methods 
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-Score 
Combination 
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Method 



A=TRLocator, B=MACS, C=PeakSeq, D=QuEST, E=CisGenome, F=SISSRs 
Figure 7 Coverage for intersection of two methods. 



Coverage for tlie ITnioii of Two Methods 



25000 




-Score 

Combination 
-Rank 

Combination 



5000 
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Method 



A=TRLocaton B=MACS, C=PeakSeq, D=QuEST, E=CisGenome, F=SISSRs 

Figure 8 Coverage for union of two metliods 



high-scoring regions among multiple methods can also 
be used to suggest potential TSS. 
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